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Abstract 

Traditional disease surveillance systems suffer from 
several disadvantages, including reporting lags and an¬ 
tiquated technology, that have caused a movement to¬ 
wards internet-based disease surveillance systems. In¬ 
ternet systems are particularly attractive for disease out¬ 
breaks because they can provide data in near real-time 
and can be verified by individuals around the globe. 
However, most existing systems have focused on dis¬ 
ease monitoring and do not provide a data repository for 
policy makers or researchers. In order to fill this gap, we 
analyzed Wikipedia article content. 

We demonstrate how a named-entity recognizer can be 
trained to tag case counts, death counts, and hospital¬ 
ization counts in the article narrative that achieves an 
FI score of 0.753. We also show, using the 2014 West 
African Ebola virus disease epidemic article as a case 
study, that there are detailed time series data that are 
consistently updated that closely align with ground truth 
data. 

We argue that Wikipedia can be used to create the first 
community-driven open-source emerging disease detec¬ 
tion, monitoring, and repository system. 


Introduction 


Most traditional disease surveillance systems rely on data 
from patient visits or lab records ( Losos 199^ Burkhead and 
Maylahn 2000[ Adams et al. 20131. These systems, while 


generally recognized to contain accurate information, rely 
on a hierarchy of public health systems that causes report¬ 
ing lags of up to 1-2 weeks in many cases (Burkhead and 


Maylahn 2000 1 . Additionally, many regions of the world 


lack the infrastructure necessary for these systems to pro¬ 
duce reliable and trustworthy data. Recently, in an effort 
to overcome these issues, timely global approaches to dis¬ 
ease surveillance have been devised using internet-based 


data. Data sources such as search engine queries (e.g., (Pol 


lotta 20101 
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T|l), and 


[Generous et al. 2014| l) have been shown to be effective in 
this arena. 
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A notably different internet-based disease surveillance 
tool is HealthMap ( jFreifeld et al. 2008| l. HealthMap an¬ 
alyzes, in real-time, data from a variety of sources (e.g., 
ProMED-mail ( [Madoff 2004| ), Google News, the World 
Health Organization) in order to allow simple querying, fil¬ 
tering, and visualization of outbreaks past and present. Dur¬ 
ing emerging outbreaks, HealthMap is often used to under¬ 
stand the current state (e.g., incidence and death counts, out¬ 
break locations). For example, HealthMap was able to detect 
the 2014 Ebola epidemic nine days before the World Health 
Organization (WHO) officially announced it ([Greenemeierj 
l20T4l l. 

While HealthMap has certainly been influential in the dig¬ 
ital disease detection sphere, it has some drawbacks. First 
and foremost, it runs on source code that is not open and 
relies on certain data sources that are not freely available 
in their entirety (e.g.. Moreover Newsdesl0. Some argue 
that there is a genuine need for open source code and open 
data in order to validate, replicate, and improve existing sys¬ 
tems ( [Generous et al. 2014) l. They argue that while certain 
closed source services, such as HealthMap and Google Flu 
Trends ( [Ginsberg et al. 2009] l, are popular and useful to the 
public, there is no way for the public to contribute to the 
service or continue the service, should the owners decide to 
shut it down. For example, Google offers a companion site to 
Google Flu Trends, Google Dengue Trend^ However, since 
Google’s source code and data are closed, it is not possible 
for anyone outside of Google to create similar systems for 
other diseases, e.g., Google Ebola Trends. Additionally, it is 
not possible for anyone outside of the HealthMap develop¬ 
ment team to add new features or data sources to HealthMap. 
For these reasons. Generous et al. argue for the use of Wiki¬ 
pedia access logs coupled with open source code for digital 
disease surveillance. 

Much richer Wikipedia data are available, however, than 
just access logs. The entire Wikipedia article content and 
edit histories are available, complete with edit history meta¬ 
data (e.g., timestamps of edits and IP addresses of anony¬ 
mous editors). A plethora of open media—audio, images, 
and video—are also available. 

Wikipedia has a history of being edited and used, in 


* http ://w ww.moreover.com/ 
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many cases, in near real-time during unfolding news events. 
Keegan et al. have been particularly instrumental in under¬ 
standing Wikipedia’s dynamics during unfolding breaking 
news events, such as natural disasters and political con- 
flicts and scandals ([ Keegan, Gergle, and Contracto r 201 1[ 
Keegan, Gergle, arid Contractor 2013{]Keegan 2013| l. They 
have provided insight into editor networks as well as edit¬ 
ing activity during news events. Recognizing that Wikipedia 
might offer useful disease data during unfolding epidemi¬ 
ological events, this study presents a novel use of Wiki¬ 
pedia article content and edit history in which disease data 
(i.e., case, death, and hospitalization counts) are elicited in a 
timely fashion. 

We study two different aspects of Wikipedia content as it 
relates to unfolding disease events; 

1. Using standard natural language processing (NLP) tech¬ 
niques, we demonstrate how to capture case counts, death 
counts, and hospitalization counts from the article text. 

2. Using the 2014 West African Ebola virus epidemic article 
as a case study, we show there are valuable time series 
data present in the tables found in certain articles. 

We argue that Wikipedia data can not only be used for dis¬ 
ease surveillance but also as a centralized repository system 
for collecting disease-related data in near real-time. 


Methods 

Disease-related information can be found in a number of 
places on Wikipedia. We demonstrate how two aspects of 
Wikipedia article content (historical changes to article text 
and tabular content) can be harvested for disease surveil¬ 
lance purposes. We first show how a named-entity recog¬ 
nizer can be trained to elicit “important” phrases from out¬ 
break articles, and we then study the accuracy of tabular 
time series data found in certain articles using the 2014 West 
African Ebola epidemic as a case study. 


Wikipedia data 

Wikipedia is an open collaborative encyclopedia consist¬ 
ing of approximately 30 million articles across 287 lan- 


guages ([Wikimedia Eoundation 2014f Wikimedia Eounda- 


tion 2014g)l. The English edition of Wikipedia is by far the 


largest and most active edition; it alone contains approxi 
mately 4.7 million articles, while the next largest Wikipedia 
edition (Swedish) contains only 1.9 million articles (|Wiki 


[media Eoundation 2014g[ ). The textual content of the current 
revision of each English Wikipedia article totals approxi¬ 
mately 10 gigabytes ( | Wikimedia Eoundation 20f4d l. 

One of Wikipedia’s primary attractions to researchers is 
its openness. All of the historical article content, dating back 
to Wikipedia’s inception in 2001, is available to anyone 
free of charge. Wikipedia content can be acquired through 
two means: a) Wikipedia’s official web AP^ or b) down¬ 
loadable database dump^ Although the analysis in this 
study could have been done offline using the download¬ 
able database dumps, this option is in practice difficult, as 


^http://www.mediawiki.org/wiki/API:Main_page 

"'http://dumps.wikimedia.org/enwiki/latest/ 


the database dumps containing all historical English article 
revisions are very large (multiple terabytes when uncom¬ 
pressed) fWikimedia Eoundation 2014h l. We therefore de¬ 
cided to use Wikipedia’s web API, caching content when 
appropriate. 

Wikipedia contains many articles on specific disease out¬ 
breaks and epidemics (e.g., the 2014 West Africa Ebola 
epidemi(0 and the 2012 Middle Eastern Respiratory Syn¬ 
drome Coronavirus (MERS-CoV) outbreal|^. We identified 
two key aspects of Wikipedia disease outbreak articles that 
can aid disease surveillance efforts: a) key phrases in the 
article text and b) tabular content. Most outbreak articles 
we surveyed contained; dates, locations, case counts, death 
counts, case fatality rates, demographics, and hospitaliza¬ 
tion counts in the text. These data are, in general, swiftly 
updated as new data become available. Perhaps most impor¬ 
tantly, sources are often provided so that external review can 
occur. The following two excerpts came from the articles on 
the 2012 MERS-CoV outbreak and 2014 Ebola epidemic, 
respectively: 


On 16 April 2014, Malaysia reported its first MERS- 
COV related death, The person was a 54 year-old 
man who had traveled to Jeddah, Saudi Arabia, to¬ 
gether with pilgrimage group composed of 18 people, 
from 15-28 March 2014. He became ill by 4 April, and 
sought remedy at a clinic in Johor on 7 April. He was 
hospitalized by 9 April and died on 13 April. 

[media Eoundation 2014a]l 


(Wiki- 


On 31 March, the U.S. Centers for Disease Control 
and Prevention (CDC) sent a five-person team to as¬ 
sist Guinea’s Ministry of Health and the WHO to lead 
an international response to the Ebola outbreak. On 
that date, the WHO reported 112 suspected and con¬ 
firmed cases including 70 deaths. Two cases were re¬ 
ported from Liberia of people who had recently trav¬ 
eled to Guinea, and suspected cases in Liberia and 
Sierra Leone were being investigated. On 30 April, 
Guinea’s Ministry of Health reported 221 suspected 
and confirmed cases including 146 deaths. The cases 
included 25 health care workers with 16 deaths. By 
late May, the outbreak had spread to Conakry, Guinea’s 
capital, a city of about two million inhabitants.On 
28 May, the tot al cases reported had reached 2 81 with 
186 deaths.(Wikimedia Eoundation 2014b i 


Although most outbreak articles contain content similar 
to the above examples, not all outbreak articles on Wiki¬ 
pedia contain tabular data. The tabular data that do exist, 
though, are often consistently updated. Eor example, Eig- 
ure [T] presents a screenshot of a table taken from the 2014 
Ebola epidemic article. This table contains case counts and 
death counts for all regions of the world affected by the epi¬ 
demic, complete with references for the source data. The 


^http://en. wikipedia.org/wiki/Ebola_virus_epidemicJn_West_ 
Africa 

*http://en. wikipedia.org/wiki/2012jyiiddle_East_respiratory_ 
syndrome_coronavirus_outbreak 
































Figure 1: Table containing updated worldwide Ebola case 
counts and death counts. This is a screenshot taken directly 
from the 2014 Ebola epidemic Wikipedia article ( [Wikimedia| 
Eoundation 2014b] ). Time granularity is irregular but is in 
general every 2-5 days. References are also provided for all 
data points. 


time granularity is irregular, but updated counts are consis¬ 
tently provided every 2-5 days. 

While there are certainly other aspects of Wikipedia ar¬ 
ticle content that can be leveraged for disease surveillance 
purposes, these are the two we focus on in this study. The 
following sections detail the data extraction methods we use. 


Named-entity recognition 

In order to recognize certain key phrases in the Wikipedia ar¬ 
ticle narrative, we trained a named-entity recognizer (NER). 
Named-entity recognition is a task commonly used in nat¬ 
ural language processing (NLP) to identify and categorize 
certain key phrases in text (e.g., names, locations, dates, or¬ 
ganizations). NERs are sequence labelers', that is, they label 
sequences of words. Consider the following example ( |Wiki- 
|media Eoundation 2014e| ): 

Jim bought 300 shares of Acme Corp. in 2006. 

Entities in this example could be named as follows: 


[Jim]pERsoN bought 300 shares of [Acme 
Corp.] organization in [2006]time- 


This study specifically uses Stanford’s NER (Einkel, 
Grenager, and Manning 2005| p The Stanford NER is 


an implementation of a conditional random field (CRE) 
model ( |Sutton 2011 1. CREs are probabilistic statistical mod¬ 
els that are the discriminative analog of hidden Markov 
models (HMMs). Generative models, such as HMMs, learn 
the joint probability p{x,y), while discriminative models, 
such as CREs, learn the conditional probability p{y \ x). 
In practice, this means that generative models like HMMs 
classify by modeling the actual distribution of each class, 
while discriminative models like CREs classify by modeling 


’http://nlp.stanford.edu/software/CRF-NER.shtml 


the boundaries between classes. In most cases, discrimina¬ 
tive models outperform generative models (|Ng and Jordan] 

[2OO2I 1. 

While Stanford’s NER includes models capable of 
recognizing common named entities, such as PERSON, 
ORGANIZATION, and LOCATION, it also provides the ca¬ 
pability for us to train our own model so that we can capture 
new types of named entities we are interested in. Eor this 
specific task, we were interested in automatically identify¬ 
ing three entity types: a) DEATHS b) INFECTIONS, and 
c) HOSPITALIZATIONS. Our trained model should there¬ 
fore be able to automatically tag phrases that correspond to 
these three entities in the text documents it receives as input. 

NERs possess the ability to learn and generalize in order 
to identify unseen phrase patterns. Since the classifier is de¬ 
pendent on the features we provide to it (e.g., words, part 
of speech tags), it should hopefully generalize well for the 
unseen instances. A more simplistic pattern-matching ap¬ 
proach, such as regular expressions, is not practical due to 
inherent variation. Eor example, the following phrases from 
our dataset all contain INFECTIONS entities: 

1. ... a total of 17 patients with confirmed H7N9 virus in¬ 
fection ... 

2. ... there were only sixty-five cases and four deaths ... 

3. ... more than 16,000 cases were being treated ... 

Example 1 has the pattern [number] patients, while 
examples 2 and 3 follow the pattern [number] cases. 
However, example 2 spells out the number, while example 
3 provides the numeral. A simple regular expression cannot 
capture the variability found in our dataset; we would need 
to define dozens of regular expressions for each entity type, 
and rigidity of regular expressions would limit the likelihood 
that we would be able to identify entities in new unseen pat¬ 
terns. 

A number of steps were required to prepare the data for 
annotation so that the NER could be trained: 

1. We first queried Wikipedia’s API in order to get the com¬ 
plete revision history for the articles used in our training 
set. 

2. We cleaned each revision by stripping all MediaWiki 
markup from the text, as well as removing tables. 

3. We computed the diff (i.e., textual changes) between suc¬ 
cessive pairs of articles. This provided lines deleted and 
added between the two article revisions. We retained a list 
of all the line additions across all article revisions. 

4. Many lines in this resulting list were similar to one an¬ 

other (e.g., “There are 45 new cases.” —“There are 
56 new cases.”). Eor the purposes of training the NER, 
it is not necessary to retain highly similar or identical 
lines. We therefore split each line into sentences and re¬ 
moved similar sentences by computing the Jaccard sim¬ 
ilarity between each sentence using trigrams as the con¬ 
stituent parts in the Jaccard equation. The Jaccard simi¬ 
larity equation for measuring the similarity between two 
sets A and B, defined as J{A, B) = , is commonly 

used for near-duplicate detection ([Manning, Raghavan, 






























































































































and Schiitze 2009| l. We only kept sentences for which the 
similarity with all the distinct sentences retained so far 
was no greater than 0.75. 

5. We split each line into tokens in order to create a tab- 
separated value file that is compatible with Stanford’s 
NER. 


6. Fin ally, we used Stanford ’s part-of-speech (POS) tag 


ger ( Toutanova et al. 2003| [ 


token. 


to add a POS feature to each 


In order to train the NER, we annotated a dataset derived 
from the following 14 Wikipedia articles generated accord¬ 
ing to the above methodology: a) Ebola virus epidemic in 
West Afric^ b) Haiti cholera outbreal|^ c) 2012 Middle 
East respiratory syndrome coronavirus outbreal|^ d) New 
England Compounding Center meningitis outbreak ^ e) In¬ 
fluenza A virus subtype H7NS[*^ f) 2013-14 chikungunya 



dengue outbreak in Pakistarp] k) 2009-10 West African 
me ning itis outbrealf^ 1) Mumps outbreaks in the 21st cen- 
turjj^j m) Zimbabwean cholera outbrealp^ and n) 2006 
dengue outbreak in Indi^^ The entire cleaned and anno¬ 
tated dataset contained approximately 55,000 tokens. The 
inside-outside-beginning (lOB) scheme, popularized in part 
by the CoNLL-2003 shared task on language-independent 


named-entity recognition (Tjong Kim Sang and De Meulder 
|2003| ), was used to tag each token. The lOB scheme offers 
the ability to tie together sequences of tokens that make up 
an entity. 

The annotation task was split between two annotators (the 
first and second authors). In order to tune inter-annotator 
agreement, the annotators each annotated three sets of 5,000 
tokens. After each set of annotations, differences were iden¬ 
tified, and clarifications to the annotation rules were made. 


*http: //nip. Stanford .edu/software/tagger, shtml 
®http://en. wikipedia.org/wiki/Ebola_virus_epidemicJn_West_ 
Africa 

'®http://en. wikipedia.org/wiki/Haiti_cholera_outbreak 
"http://en.wikipedia.org/wiki/2012JVIiddle_East_respiratory_ 
syndrome_coronavirus_outbreak 

"http://en.wikipedia.org/wiki/New_England_Compounding_ 
Center _meningitis_outbreak 

"http://en.wikipedia.org/wiki/Influenza_A_virus_subtype_ 

H7N9 

"http://en.wikipedia.org/wiki/2013%E2%80%9314_ 

chikungunya_outbreak 

"http://en.wikipedia.org/wiki/Chikungunya_outbreaks 

"http://en.wikipedia.org/wiki/Dengue_fever_outbreaks 

"http://en.wikipedia.org/wiki/2013_dengue_outbreakJn_ 

Singapore 

"http://en.wikipedia.org/wiki/201 l_dengue_outbreak_in_ 
Pakistan 

"http://en.wikipedia.org/wiki/2009%E2%80%9310_West_ 
African _meningitis_outbreak 

^®http://en. wikipedia.org/wiki/Mumps_outbreaks_in_the_21 st_ 
century 

http://en.wikipedia.org/wiki/Zimbabwean_cholera_outbreak 
^^http://en. wikipedia.org/wiki/2006_dengue_outbreak_in_India 


The third set resulted in a Cohen’s kappa coefficient of 
0.937, indicating high agreement between the annotators. 


Tabular data 

To understand the viability of tabular data in Wikipedia, we 
concentrate on the Ebola virus epidemic in West Africa ar- 
ticlj^ We chose this article for two reasons. First, the epi¬ 
demic is still unfolding, which makes it a concern for epi¬ 
demiologists worldwide. Second, the epidemiological com¬ 
munity has consistently updated the article as new develop¬ 
ments are publicized. Ideally, we would analyze all disease 
articles that contain tabular data, but the technical challenges 
surrounding parsing the constantly changing data leave this 
as future work. 

Ebola is a rare but deadly virus that first appeared in 1976 
simultaneously in two different remote villages in Africa. 
Outbreaks of Ebola virus disease (EVD), previously known 
as Ebola hemorrhagic fever (EHF), are sporadic and gener¬ 
ally short-lived. The average case fatality rate is 50%, but 
it has varied between 25% and 90% in previous outbreaks. 
EVD is transmitted to humans from animals (most com¬ 
monly, bats, apes, and monkeys) and also from other humans 
through direct contact with blood and body fluids. Signs and 
symptoms appear within 2-21 days of exposure (average 8- 
10 days) and include fever, severe headache, muscle pain, 
weakness, diarrhea, vomiting, abdominal pain, and unex¬ 
plained bleeding or bruising. Although there is currently no 
known cure, treatment in the form of aggressive rehydration 
seems to improve survival rates ( World Health Organizat ion! 
|2014a[ [Centers for Disease Control and Prevention 2014f ! 

The West African EVD epidemic was officially an¬ 


nounced by the WHO on March 25, 2014 (World Health 
Organization 2014^. The disease spread rapidly and has 


proven difficult to contain in several regions of Africa. At 
the time of this writing, it has spread to 7 different countries 
(including two outside of Africa): Guinea, Liberia, Sierra 
Leone, Nigeria, Senegal, United States, and Spain. 

The Wikipedia article was created on March 29, 2014, 


four days after the WHO announced the epidemic (Wiki- 
media Foundation 2014c|l. As seen in Figure this article 


contains detailed tables of case counts and death counts by 
country. The article is regularly updated by the Wikipedia 
community (see Figure j^; over the 165-day period ana¬ 
lyzed, the article averaged approximately 31 revisions per 
day. 

We parsed the Ebola article’s tables in several steps: 

1. We first queried Wikipedia’s API to get the complete re¬ 
vision history for the West African EVD epidemic article. 
Our initial dataset contained 5,137 revisions from March 
29, 2014 to October 14, 2014. 

2. We then parsed each revision to pull out case count and 
death count time series for each revision. To parse the 
tables, we first used pandocj^ to convert the MediaWiki 
markup to consistently formatted HTML and then used 


^^http://en. wikipedia.org/wiki/Ebola_virus_epidemicJn_West_ 
Africa 

http://johnmacfarlane.net/pandoc/ 
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Figure 2: The number of revisions made each day to the 
2014 Ebola virus epidemic in West Africa Wikipedia ar¬ 
ticle (http;//en.wikipedia.org/wiki/Ebola_virus_epidemicJn_ 
West_Africa). A total of 5,137 revisions were made over the 
165-day period analyzed. 


BeautifulSoujj^ to parse the HTML. Because the Wiki¬ 
pedia time series contain a number of missing data points 
prior to June 30, 2014, we use this date for the beginning 
of our analysis; time series data prior to June 30, 2014 
are not used in this study. This resulting dataset contained 
3,803 time series. 

3. As Eigure shows, there are non-regular gaps in the 
Wikipedia time series; these gaps range from 2-5 days. 
We used linear interpolation to fill in missing data points 
where necessary so that we have daily time series. Daily 
time series data simplify comparisons with ground truth 
data (described later). 

4. Recognizing that the tables will not necessarily change 
between article revisions (i.e., an article revision might 
contain edits to only the text of the article, not to a table 
in the article), we then removed identical time series. This 
final dataset contained 39 time series. 

Results 

Named-entity recognition 

To test the classifier’s performance, we averaged precision, 
recall, and El score results from 10-fold cross-validation. 
Table [T] demonstrates a typical confusion matrix used to 
bin cross-validation results, which are then used to compute 
precision, recall, and the El score. Precision asks, “Out of 
all the examples the classifier labeled, what fraction were 
correct?” and is computed as . Recall asks, “Out of 

all labeled examples, what fraction did the classifier recog¬ 
nize?” and is computed as The El score is the har¬ 
monic mean of precision and recall; 2 • ■ All 

^ precision+recall 

three scores range from 0 to 1, where 0 is the worst score 
possible and 1 is the best score possible. 

http: //WWW. crummy, com/ software/B eautifulS oup/ 


Table 1: Typical classifier confusion matrix. 



Ground truth positive 

Ground truth negative 

Test positive 

True positive (TP) 

False positive (FP) 

Test negative 

False negative (FN) 

True negative (TN) 


Table 2: Classifier performance determined from 10-fold 
cross-validation. 


maxNGramLeng 

Precision 

Recall 

FI score 

1 

0.820 

0.693 

0.747 

2 

0.810 

0.690 

0.740 

3 

0.815 

0.702 

0.750 

4 

0.814 

0.709 

0.753 

5 

0.813 

0.709 

0.753 

6 

0.812 

0.710 

0.753 

7 

0.812 

0.706 

0.751 

8 

0.814 

0.708 

0.753 

9 

0.815 

0.707 

0.753 

10 

0.815 

0.708 

0.753 

11 

0.813 

0.708 

0.753 

12 

0.811 

0.709 

0.752 


Table shows these results as we varied the 
maxNGramLeng option (Stanford’s default value is 
6). The maxNGramLeng option determines sequence 
length when training. We were somewhat surprised to 
discover that larger maxNGramLeng values did not 
improve the performance of the classifier, indicating that 
more training data are likely necessary to further improve 
the classifier. Eurthermore, roughly maximal performance 
is achieved with maxNGramLeng = 4; there is no tangible 
benefit to larger sequences (despite this, we concentrate 
on the maxNGramLeng = 6 case since it is the default). 
Our 14-article training set achieved precision of 0.812 
and recall of 0.710, giving us an El score of 0.753 for 
maxNGrcunLeng = 6. 

Eor maxNGrcunLeng = 6, Table shows the aver¬ 

age precision, recall, and El scores for each of the 
named entities we annotated (DEATHS, INFECTIONS, 
and HOSPITALIZATIONS). There were a to¬ 
tal of 264 DEATHS, 633 INFECTIONS, and 16 
HOSPITALIZATIONS entities annotated across the 
entire training dataset. Recall that we used the JOB scheme 
for annotating sequences; this is reflected in Table with 
B-* indicating the beginning of a sequence and I-* 
indicating the inside of a sequence. It is generally the 
case that identifying the beginning of a sequence is easier 
than identifying all of the inside words of a sequence; the 
only exception to this is HOSPITALIZATIONS, but we 
speculate that the identical beginning and inside results for 
this entity are due to the relatively small sample size. 

Tabular data 

To compute the accuracy of the Wikipedia West African 
EVD epidemic time series, we used Caitlin Rivers’ crowd- 














































Table 3: Classifier performance for each of the entities we 
used in our annotations. 


Named entity 

Precision 

Recall 

FI score 

B-Deaths 

0.888 

0.744 

0.802 

I-Deaths 

0.821 

0.730 

0.764 

B-Infections 

0.812 

0.719 

0.756 

I-Infections 

0.762 

0.714 

0.730 

B-Hospitalizations 

0.933 

0.833 

0.853 

I-Hospitalizations 

0.933 

0.833 

0.853 


sourced Ebola dat^^ Her country-level data come from of¬ 
ficial WHO data and reports. As with the Wikipedia time 
series, we used linear interpolation to fill in missing data 
where necessary so that the ground truth data are specified 
daily; this ensured that the Wikipedia and ground truth time 
series were specified at the same granularity. Note that time 
granularity of the WHO-based ground truth dataset is gen¬ 
erally finer than the Wikipedia data; the gaps in the ground 
truth time series were not the same as those in the Wikipedia 
time series. In many cases, the ground truth data were up¬ 
dated every 1-2 days. 

We compared the 39 Wikipedia epidemic time series to 
the ground truth data by computing the root-mean-square er¬ 
ror (RMSE). We use the RMSE rather than the mean-square 
error (MSE) because the testing and ground truth time series 
both have the same units (cases or deaths); when they have 
the same units, the computed RMSE also has the same unit, 
which makes it easily interpretable. The RMSE, 


RMSE = 



i=l 


computes the average number of cases or deaths differ¬ 
ence between a Wikipedia epidemic time series (F) and the 
ground truth time series (F). Eigure shows how the case 
time series and death time series RMSE changes with each 
table revision for each country. Of particular interest is the 
large spike in Eigurej^on July 8, 2014 in Liberia and Sierra 
Leone. Shortly after the 6:27pm spike, an edit from a differ¬ 
ent user at 8:16pm the same day with edit summary “correct 
numbers in wrong country columns” corrected the error. 

The average RMSE values for each country’s time series 
are listed in Table Even in the worst case, the average de¬ 
viation between the Wikipedia time series and the ground 
truth is approximately 19 cases and 12 deaths. Considering 
the magnitude of the number of cases (e.g., approximately 
1,500 in Liberia and 3,500 in Sierra Leone during the time 
period considered) and deaths (e.g., approximately 850 in 
Liberia and 1,200 in Sierra Leone), the Wikipedia time se¬ 
ries are generally within 1-2% of the ground truth data. 


Conclusions 

Internet data are becoming increasingly important for dis¬ 
ease surveillance because they address some of the exist¬ 
ing challenges, such as the reporting lags inherent in tradi¬ 
tional disease surveillance data, and they can also be used 

^*https://github.com/cmrivers/ebola 
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Eigure 3: Root-mean-square error (RMSE) values for the 
cases and deaths time series are shown for each revision 
where the tables changed. The RMSE spikes on July 8, 2014 
(Liberia and Sierra Leone) and August 20, 2014 (Liberia) in 


3 a were due to Wikipedia contributor errors and were fixed 


shortly after they were made. Most RMSE spikes are quickly 
followed by a decrease; this is due to updated WHO data or 
contributor error detection. 


Table 4: Average cases and deaths RMSE across all table 
revisions. 


Conntry 

Mean Cases RMSE 

Mean Deaths RMSE 

Guinea 

3.790 

2.701 

Liberia 

18.168 

11.983 

Nigeria 

0.310 

0.189 

Senegal 

0.403 

0.008 

Sierra Leone 

18.847 

12.015 

Spain 

18.243 

0.050 

United States 

0.174 

0.000 



















































to detect and monitor emerging diseases. Additionally, in¬ 
ternet data can simplify global disease data collection. Col¬ 
lecting disease data is a formidable task that often requires 
browsing websites written in an unfamiliar language, and 
data are specified in a number of formats ranging from 
well-formed spreadsheets to unparseable PDF hies contain¬ 
ing low resolution images of tables. Although several pop¬ 
ular internet-based systems exist to help overcome some 
of these traditional disease surveillance system weaknesses, 
most notably HealthMap ( Freifeld et al. 200^ and Google 
Flu Trends ( [Ginsberg et al. 2009| ), no such system exists that 
relies solely on open data and runs using 100% open source 
code. 

Previous work explored Wikipedia access logs to tackle 
some of the disadvantages traditional disease surveillance 
systems face (Mclver and Brownstein 2014) Generous et al. 


20141. This study explores a new facet of Wikipedia: the 


content of disease-related articles. We present methods on 
how to elicit data that can potentially be used for near-real- 
time disease surveillance purposes. We argue that in some 
instances, Wikipedia may be viewed as a centralized crowd- 
sourced data repository. 

First, we demonstrate using a named-entity recognizer 
(NER) how case counts, death counts, and hospitalization 
counts can be tagged in the article natTative. Our NER, 
trained on a dataset derived from 14 Wikipedia articles 
on disease outbreaks/epidemics, achieved an El score of 
0.753, evidence that this method is fully capable of recog¬ 
nizing these entities in text. Second, we analyzed the qual¬ 
ity of tabular data available in the 2014 West Africa Ebola 
virus disease article. By computing the root-mean-square er¬ 
ror (RMSE), we show that the Wikipedia time series very 
closely align with WHO-based ground truth data. 

There are many future directions for this work. Eirst and 
foremost, more training data are necessary for an operational 
system in order to improve precision and recall. There are 
many more disease- and outbreak-related Wikipedia articles 
that can be annotated. Additionally, other open data sources, 
such as ProMED-mail, might be used to enhance the model. 
Second, a thorough analysis of the quality and correctness of 
the entities tagged by the NER is needed. This study presents 
the methods by which disease-related named entities can be 
recognized, but we have not throughly studied the cotTect- 
ness and timeliness of the data. Third, our analysis of tabu¬ 
lar data consisted of a single article. A more rigorous study 
looking at the quality of tabular data in more articles is nec¬ 
essary. Einally, the work presented here considers only the 
English Wikipedia. NERs are capable of tagging entities in 
a variety of other languages; more work is needed to under¬ 
stand the quality of data available in the 286 non-English 
Wikipedias. 

There are several limitations to this work. Eirst, the 
ground truth time series we used to compute RMSEs is 
static, while the Wikipedia time series vary over time. Be¬ 
cause the relatively recent static ground truth time series 
may contain corrections for reporting errors made earlier 
in the epidemic, the RMSE values may be artihcially in¬ 
flated in some instances. Second, we are ignoring the user- 
provided edit summary. This edit summary provides infor¬ 


mation about why the edit was made. The edit summary 
identihes article vandalism (and subsequent vandalism re¬ 
version) as well as content corrections and updates. Tak¬ 
ing these edit summaries into account can further improve 
model performance (e.g., processing edit summaries would 
allow us to disregard the erroneous edit that caused the July 
8, 2014 spike in Eigurej^. 

Ultimately, we envision this work being incorporated into 
a community-driven open-source emerging disease detec¬ 
tion and monitoring system. Wikipedia access log time se¬ 
ries gauge public interest and, in many cases, correlate very 
well with disease incidence. A community-driven effort to 
improve global disease surveillance data is imminent, and 
Wikipedia can play a crucial role in realizing this need. 
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